Network monitoring method and system

ABSTRACT

A method of monitoring a network, in particular a data or telecommunication network, where the network includes a number of units to be monitored, and where relevant states of these units are controlled with respect to functionality, efficiency and/or security by means of provided information elements called managed objects at the monitored units, that reflect states and/or parameters to be monitored, and that are read and/or written by a network management system using network management protocols, is characterized in that at the managed units an additional management object—the health check object—is being implemented, such that almost all relevant states of the respective monitored unit are aggregated into a single value, that can be read by the network management system.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a method for monitoring a network,particularly data or communication network, where the network includes aplurality of units to be monitored, and where relevant states of thesemonitored units are controlled with respect to functionality, efficiencyand/or security by means of provided information elements called managedobjects at the monitored units, that reflect states to be monitored, andthat are read and/or written by a network management system using apredetermined network management protocol.

2. Description of the Related Art

Facing complexity and size of today's data and telecommunicationnetworks, the importance of effective network management, particularlyof effective network monitoring, is becoming increasingly higher. Fornetwork management, many of today's data and telecommunication networksuse a network management system (NMS) that communicates with a number ofmonitored units (MU's) in the network. Monitored units are devices ofany kind, i.e. server, hosts, routers, etc. For communication betweenNMS and MU, in general, standardized network management protocols may beused, as for example Simple Network Management Protocol (SNMP) that iscommon in IP (Internet Protocol) -based networks and the CommonManagement Information Protocol (CMIP) that is common intelecommunication networks.

As part of its network monitoring activities, the NMS reads or writesmanaged objects (MO's) at the monitored units and in turn the monitoredunits send notifications back to the NMS. A monitored unit MO is aninformation unit with clearly defined semantics, that is implemented asa passive memory cell at a MU and that directly corresponds to the MU. Amanaged object MO may be for example a counter, a string of textcharacters, or something else of this kind, which can indicate, forexample, the current status of a communication link connected to thecorresponding MU. By reading managed objects MO's from a particularmanaged unit, the NMS can retrieve information about the current statusof the monitored unit at which the MO's are implemented, for example onthe status of a communication link. By writing to particular MO's, theNMS can change status or configuration of a MU, for example by settingthe status of a communication link to ‘inactive.’

In order to achieve interoperability between NMS and monitored units ofdifferent manufacturers, MO's are standardized, for example inrecommendations of the International Telecommunications Union (ITU) andin Requests for Comments (RFC's) of the Internet Engineering Task Force(IETF). In these standard documents, MO's are defined in a way that aNMS can receive sufficient and appropriately detailed information on themanaged units. For example, there is a monitored object MO indicatingthe link status for each link connected to the managed unit.

Monitoring communication networks in operation includes regularlychecking network status and configuration. For this purpose, apre-definable number of MO's must be read from each managed unit MU,where always the most current value must be read from the MU. For eachof these MO's, an operation is required that checks whether or not thevalue of the individual MO—representing a relevant status with respectto functionality, efficiency and/or security of the network—or acombination of this value with values of other MO's is within acceptablelimits for normal operations. If one of the values exceeds the limit,then the NMS must become active in order to return to normal operations.A set of all operations for checking whether or not all values of MO'sin a MU are within the acceptable limits for normal operations is called‘health check.’

In this context, it is problematic that checking all relevant MO's atall monitored units can cause scalability problems in the case where thenumber of monitored units becomes too large. The total number of MO's tobe monitored is the product of the number of monitored units and theaverage number of MO's selected per monitored unit. A NMS has a limitfor this total number; if the limit is exceeded, a network cannot bemonitored in a sufficient way. Hence, for a fixed number of MO's to bemonitored per monitored unit, the number of units to be monitored islimited.

A known approach to avoiding this problem is the reduction of MO's to bemonitored per MU based on programmability of monitored units, whichallows a NMS to load programs on monitored units. Such a programperforms a ‘health check’ locally at a MU, either for the respective MUonly or also for a limited number of further monitored units. Thisapproach is called “Management by Delegation (MbD).” See GermanGoldszmidt and Yechiam Yemini “Distributed Management by Delegation” (Inproceedings of the 15th International Conference on DistributedComputing Systems, June 1995).

Three technologies realizing the approach of Management by Delegationhave been studied, implemented and standardized.

The ITU has developed the so-called Command Sequencer fortelecommunication networks, which is documented in ITU-T recommendationX.753. The Command Sequencer allows loading complex programs onmonitored units, provided that the programs are to be written in aspecific programming language that is part of the standard.

The IETF has standardized a more flexible technology called Script MIBthat is documented in IETF RFC 3165. It allows loading of in arbitraryprogramming languages and for arbitrary runtime environments, as far asthey are supported by the respective MU.

The IETF also has developed a simpler and functionally more restrictedtechnology called Expression MIB that is documented in IETF RFC 2982.The Expression MIB allows a NMS to create simple expressions consistingof operations on MO's. These expressions can be used recursively forcreating more complex expressions, for example a complete health checkof a MU.

The above-described method based on Management by Delegation can reducethe number of MO's to be read, but at the same time, it has severaldrawbacks.

First, the manufacturing costs increase, because all monitored unitsmust be extended by a program loader and a runtime environment forloaded programs. Furthermore, the complexity of the entire NMSincreases, since programs for the local ‘health check’ must be providedin programming languages or for runtime environments that are availableat the monitored units. Hence, several software components in differentprogramming languages and/or for different runtime systems need to bedeveloped and maintained.

Second, security problems arise such that an unauthorized person orsystem may load and start harmful programs. Accordingly, loading ofarbitrary programs must be strictly controlled. Several securitymechanisms are required for this, leading again to an increase of costand complexity. More specifically, loading of programs on monitoredunits must be restricted and controlled. Access of running programs toMO's must be appropriately restricted and controlled. For example, writeaccess to monitored MO's should be blocked. Furthermore, the runtimeenvironment must be restricted such that programs cannot access otherresources of monitored units.

The techniques of Command Sequencer and Script MIB as described abovehave a few organizations, but they were never used to monitor largenetworks, because of the above-mentioned drawbacks.

The technique of Expression MIB as described above is less costly, lesscomplex and easier to secure, compared with Script MIB and CommandSequencer, but the problems described above are not sufficientlyreduced, which prevents this technology from being deployed in largenetworks with the result that the IETF does not recommend anymore toimplement it.

SUMMARY OF THE INVENTION

An object of the present invention is to provide network monitoringmethod and system, which can achieve network monitoring with highefficiency and security even in the case of large networks.

According to the present invention, in each monitored unit an additionalmanaged object—‘health check’ object—is implemented, that unifies oraggregates all relevant states of the monitored unit into a single valuethat can be read by the network management system.

Since an increase of scalability can be realized with highcost-performance and non-critical concerning security, large networkscan be monitored efficiently.

From the view of the invention, it was recognized that conventionalapproaches to monitoring of networks, particularly of large, complexnetwork architectures, are not feasible because of scalability problems.

According to the present invention, the introduction of an additionalmanaged object—so called ‘health check’ object—at the monitored unitsallows the NMS to only read a single managed object—namely ‘healthcheck’ object—instead of a large number of MO's.

According to the invention, the ‘health check’ object is implementedsuch that almost all relevant states of the respective monitored unitare unified or aggregated into a single value, that indicates the totalresult of the ‘health check’. The result is that significantly largernetworks can be managed by a single NMS.

Also, the invention can provide remarkable improvement in security,compared to the conventional methods, because neither arbitrary programsnor expressions are transmitted. In contrast to technologies such as theCommand Sequencer or the Script MIB, it is not possible according to thepresent invention that unauthorized persons or systems for example, loadharmful programs to a MU, which affect the operation thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram schematically showing the structure of anetwork monitoring system according to an embodiment of the presentinvention;

FIG. 2 is a schematic diagram showing the functional structure ofprocesses and objects implemented in a monitored unit in the embodiment;and

FIG. 3 is a schematic diagram showing calculation of health check objectin the embodiment.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring to FIG. 1, a network management station (NMS) 101 manages aplurality of managed units MU₁-MU_(N) in a network. Each managed unit MUis any kind of network element (NE) such as a server, host computer orrouter. Communication between NMS and each MU is made using a networkmanagement protocol (here, SNMP: Simple Network Management Protocol).

Each managed unit MU has a SNMP agent as a resident process implementedtherein. The SNMP agent manages MIB objects including health checkobject that will be described later and, in response to a request of aSNMP manager on the NMS 101, sends the health check object back to theSNMP manger.

As shown in FIG. 2, the SNMP agent 201 is implemented in the managedunit MU as a resident process generated by running a program on aprocessor (not shown). The SNMP agent 201 manages a health check object202 and other managed objects 203 including statistics data and statusvariables as managed objects of that managed unit.

The health check object 202 checks the other managed objects 203 atregular intervals or as necessary, and sends the resultant health checkobject back to the SNMP manager in response to the request.

In a case where some failures or status changes occur at the managedunit, a value calculated based on such an event is stored in the healthcheck object 202 and then the SNMP agent 201 accesses the value storedin the health check object 202 to send it to the SNMP manager on thenetwork management station (NMS) 101. The NMS 101 can identify the causeof the error by looking at the event-dependent value held in the healthcheck object 202, so that the NMS 101 can reconfigure the managed unit.

Referring to FIG. 3, the NMS 101 sets check thresholds and managedobject identifiers to be checked on a managed unit. A managed objectvalue is compared to a corresponding check threshold and, when at leastone managed object value has reached the corresponding threshold, thevalue of the health check object is set.

An advantageous implementation could set individual limits for each MOthat is included in the computation of a ‘health check’, for example asmaximum and minimum values and/or as a set of regular status values.Regular status values correspond to states at which functional,error-free operation of the network is indicated. In such a case,additional MO's can indicate the limits and/or status values. For an MO,that for example indicates the status of a communication line by eitherthe value ‘on’ or the value ‘off’, it is not necessary to indicate theset of regular status values. For such MO's regular state values couldbe fixed and they could be standardized.

However, for most of the other MO's that are included in the ‘healthcheck’ a flexible choice of limits and regular states is very extremelyuseful. Particularly, limits can be implemented such that they can bechosen by a NMS. For performance monitoring for example, differentoperators typically choose different threshold values for the maximumload allowed on a communication link in order to fulfill differentquality requirements.

With respect to a simple and efficient analysis of the results, a‘health check’ object could be implemented such that it either indicatesthat all states at the monitored unit are within the regular limits orthat at least one status exceeds the regular limits. In the latter case,the MO's of the respective monitored unit could be analyzed further,while in the former case no further operation are required.

For achieving high flexibility and an easy and smooth adaptation todifferent requirements a set of MO's could be selected, which are to beconsidered when computing the ‘health check’. This way, certain aspectsof network monitoring could be emphasized based on the selection ofMO's. For this purpose further MO's could be introduced, that indicatewhich MO's are to be considered in the ‘health check’. Some of theseadditional MO's could be static, others could be configurable by theNMS.

Flexibility and complexity of the computation of a health check objectcould be restricted and partially fixed in a way that is advantageouscompared to known technologies. This would save cost and reduce therequired computational power. The operations performed for computing thehealth check object could for example be restricted to comparisonoperations with threshold, values or regular states. In this case, theNMS would just configure the set of MO's to be compared and the set ofthresholds and/or states to which the MO's are compared. This would alsorestrict the arguments on which operations act: one argument would bethe MO at a MU, the other argument would be one or more selectablethreshold or status values to compare the MO to. Such a restriction willdisburden the network management, because only the values to compare toneed to be known and specified.

A further improvement of the invention uses several health check objectsat a single MU. This is particularly advantageous if several NMS'sserving different purposes manage and monitor the same network. In manycases, a NMS for monitoring the network configuration is separated froma NMS for monitoring network performance and they monitor different setsof MO's at a MU. In such a case, each NMS could create its own healthcheck object. Even if configuration monitoring and performancemonitoring are performed by the same NMS, the NMS could create twodifferent health check objects.

With respect to highly efficient network monitoring, health checkobjects could be hierarchically structured. Particularly if the healthcheck at a MU consists of several groups of checks that can be clearlyseparated, then for each of these groups an individual health checkobject could be created. The total result of the entire health checkcould be represented by a single higher level health check object thatincludes only the values of lower level health check objects in itscomputation. The hierarchical decomposition of the health check objectcould be decomposed further in a recursive way such that more than twohierarchy levels are created.

With respect to enhanced expressiveness of the conducted health check,weights could be assigned to the comparison operations that areperformed for the individual MO's to be included in the computation ofthe health check object. The weights could reflect that for the MO'sincluded in the health check the significance of exceeding thethresholds can be different. By introducing further managed objects aweight could be assigned to every comparison operation that is part ofthe health check. The value of the health check object could then be,for example, the maximum of all weights of MO's for which the valueexceeds the assigned thresholds, or it could be the sum of all theseweights. If no MO exceeds its thresholds, then this fact could beindicated by setting the value of the health check object to zero.

As an alternative to the procedure that the NMS periodically checks thevalue of a health check object, a MU could take an active role andperform the health check periodically with a given time interval. If theresult of the check is a value indicating that at least one of thechecked MO's exceeds its thresholds to indicate an abnormal condition ofthe computation, a MU could send a notification to the NMS. A thresholdfor the value of the health check object could be specified such that anotification is sent if the value of the health check object exceeds thethreshold.

In order to make it easy to identify a malfunction quickly after thehealth check has indicated its existence, a notification containing ahint to the malfunction could be sent, for example, together with thefailed health check. This hint could for instance consist of the list ofMO's with values exceeding the respective thresholds. With suchinformation given, the NMS can initiate actions dealing with the faultmore quickly and more appropriately. Without such information, the NMSwould have to repeat the health check explicitly, i.e. it would have toread each MO included in the health check and check the MO's value.Therefore, it is advantageous in several cases to introduce further MO'sindicating for which of the MO's included in the health check thecomparison with thresholds failed. In such a case, the networkmanagement system NMS can quickly reconfigure the monitored units withthe erroneous managed objects in order to return to fully functionaloperation.

In a particularly advantageous way, the range of MO's at a MU, which areincluded in a health check could be extended beyond the scope of the MUto MO's of one or more other MU's. According to the concept ofManagement by Delegation, a single monitored unit can perform the healthcheck for multiple monitored units, resulting in increased scalability.Such a hierarchical approach could be structured such that each MU usesa local health check object and that the MU that performs a joint healthcheck of multiple MU's accesses the local health check object forcomputing the joint health check. Then the NMS could access the resultof the joint health check with a single read access.

Furthermore, a restriction for the resources that are available forcomputing the value of a health check object could be established. Thiswould effectively avoid that the number of MO's compared for computingthe value of a health check object is set by unauthorized entities in amalicious way so high that the MU would get overloaded and not ableanymore to sufficiently perform its original function. Alternatively oradditionally to the resource restriction, the maximum number comparisonoperations could be limited to a maximum value that still allows regularoperation.

Finally, it is pointed out that there are different possibilities ofembodying and further developing the teaching according to the inventionin an advantageous way. In this context the reader is referred to thepatent claims below.

1. A method of monitoring a network, the network including: a pluralityof units to be monitored, each of which holds managed objects reflectingstates to be monitored; and a network management system which is allowedto access a managed object of each of the plurality of monitored unitsthrough a predetermined network management protocol, the methodcomprising: implementing an additional managed object representing thestates of a plurality of managed objects of a monitored unit in each ofthe plurality of monitored units, the additional managed objectimplemented by aggregating all the states of the plurality of managedobjects into a single value; and storing the single value as a value ofthe additional managed object so as to be readable by the networkmanagement system.
 2. The method according to claim 1, wherein, for therelevant states of the monitored unit, individual limits and/or regularstate values are set.
 3. The method according to claim 2, wherein thevalue of the additional managed object indicates one of cases where allrelevant states at the monitored unit are within corresponding setlimits and where at least one status thereof exceeds a corresponding setlimit.
 4. The method according to claim 1, wherein a set of managedobjects is selected, which are used when the additional managed objectis computed.
 5. The method according to claim 4, wherein, forcalculating the value of the additional managed object, only comparisonoperations with the set limits and/or regular state values are performedand all comparison results are aggregated into the value of theadditional managed object.
 6. The method according to claim 1, whereinmultiple additional managed objects are implemented at a monitored unit.7. The method according to claim 6, wherein different additional managedobjects implemented at a monitored unit serve different aspects ofnetwork monitoring.
 8. The method according to claim 6, wherein theadditional managed objects implemented at a monitored unit arehierarchically structured into higher level and lower level additionalmanaged objects.
 9. The method according to claim 8, wherein, forcomputing the value of a higher level additional managed object, thevalues of its lower level health check objects are used.
 10. The methodaccording to claim 8, wherein additional managed objects are structuredinto more than two levels of hierarchy.
 11. The method according toclaim 5, wherein weights are assigned to the comparison operations thatare performed when computing the value of the additional managed object.12. The method according to claim 11, wherein the value of theadditional managed object is computed as the maximum of the weights ofall comparison operations for which the value of a managed objectexceeds a corresponding limit that has been set.
 13. The methodaccording to claim 11, wherein the value of the additional managedobject is computed as the sum of the weights of all compare operationsfor which the value of a managed object exceeds a corresponding limitthat has been set.
 14. The method according to claim 1, wherein thevalue of the additional managed object is computed in selectable regularintervals at the monitored unit.
 15. The method according to claim 14,wherein, in case the computed value of an additional managed objectindicates that at least one of the states exceeds its limit, themonitored unit sends a corresponding message to the network managementsystem.
 16. The method according to claim 15, wherein the message sentto the network management system contains an indication of the cause offault.
 17. The method according to claim 16, wherein the indication ofthe cause of fault is a list of managed objects for which the valueexceeds its limits.
 18. The method according to claim 17, wherein themonitored units with faulty managed objects are reconfigured by thenetwork management system.
 19. The method according to claim 1, whereina scope of an additional managed object is extended to multiplemonitored units by considering the values of managed objects at multiplemonitored units when computing the value of the additional managedobject.
 20. The method according to claim 1, wherein limits are selectedfor resources to be used for computing the value of the additionalmanaged object.
 21. The method according to claim 5, wherein a maximumnumber of permitted comparison operations is restricted to apredetermined number.
 22. A system for monitoring a network, comprising:a plurality of units to be monitored, each of which holds managedobjects reflecting states to be monitored; and a network managementsystem which is allowed to access a managed object of each of theplurality of monitored units through a predetermined network managementprotocol, wherein each of the plurality of monitored units stores asingle value obtained representing the states of all the managed objectsof the monitored unit, wherein the single value is accessible by thenetwork management system.
 23. A network element to be monitored by anetwork management system, comprising: a first storage for storing aplurality of managed objects reflecting states to be monitored; a secondstorage for storing at least one health check object represented by asingle value obtained by aggregating the states of all the managedobjects of the monitored unit; and a management agent for managing theplurality of managed objects and the at least one health check object,wherein the single value is accessible by the network management system.24. A non-transitory computer-readable storage medium with a computerprogram stored thereon for instructing a computer to allow a networkelement to be monitored by a network management system, comprising:storing a plurality of managed objects reflecting states to bemonitored; storing at least one health check object represented by asingle value obtained by aggregating the states of all the managedobjects of the monitored unit; and managing the plurality of managedobjects and the at least one health check object so as to be accessibleby the network management system.
 25. The method according to claim 1,wherein the additional managed object in each of the plurality ofmonitored units, represents the aggregation of all the other managedobjects of each of the plurality of monitored units.